Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ginkgo parallel test node CI timeout debugging #3792

Conversation

amitkrout
Copy link
Contributor

@amitkrout amitkrout commented Aug 20, 2020

What type of PR is this?
/kind failing-test

What does does this PR do / why we need it:
Recently we have Ginkgo timed out waiting for all parallel nodes to report back! error in most of our periodic job and in pr job too in devfile integration test.

May be this happens due to bad signal handling between ginkgo and the spec or may be test stuck in between. So adding -debug flag to get more information on the failure node.

As part of test timeout is concerned, the current timeout TIMEOUT ?= 7200s is enough for devfile test run. Anyway if we find in CI debug log 2hrs is not enough then we will increase it.

NOTE: I have tried locally and none of the time i am able to reproduce it.

Which issue(s) this PR fixes:

Fixes NA

PR acceptance criteria:

  • Unit test : NA

  • Integration test : NA

  • Documentation : NA

How to test changes / Special notes to the reviewer:
debug flag should report correct information on the failure node

@openshift-ci-robot
Copy link
Collaborator

@amitkrout: The label(s) kind/test-debugging cannot be applied, because the repository doesn't have them

In response to this:

What type of PR is this?
/kind test-debugging

What does does this PR do / why we need it:
Recently we have Ginkgo timed out waiting for all parallel nodes to report back! error in most of our periodic job and in pr job too in devfile integration test.

May be this happens due to bad signal handling between ginkgo and the spec or may be test stuck in between. So adding -debug flag to get more information on the failure node.

As part of test timeout is concerned, the current timeout TIMEOUT ?= 7200s is enough for devfile test run. Anyway if we find in CI debug log 2hrs is not enough then we will increase it.

NOTE: I have tried locally and none of the time i am able to reproduce it.

Which issue(s) this PR fixes:

Fixes NA

PR acceptance criteria:

  • Unit test : NA

  • Integration test : NA

  • Documentation : NA

How to test changes / Special notes to the reviewer:
debug flag should report correct information on the failure node

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign mik-dass
You can assign the PR to them by writing /assign @mik-dass in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link

codecov bot commented Aug 20, 2020

Codecov Report

Merging #3792 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3792   +/-   ##
=======================================
  Coverage   44.29%   44.29%           
=======================================
  Files         139      139           
  Lines       13400    13400           
=======================================
  Hits         5936     5936           
  Misses       6882     6882           
  Partials      582      582           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c3093e2...4a6ebfa. Read the comment docs.

@amitkrout
Copy link
Contributor Author

/test all

@amitkrout amitkrout force-pushed the ginkgoParallelTestNodeCiTimeoutDebugging branch from c1a96a4 to 85c4e7a Compare August 20, 2020 21:21
@amitkrout
Copy link
Contributor Author

amitkrout commented Aug 21, 2020

The last successfully test run was on 18th Aug 3:30 PM UTC after that our test fails due to platform and the timeout flake Ginkgo timed out waiting for all parallel nodes to report back!.

ginkgo --debug flag could help much, however there is an ongoing pr ginkgo onsi/ginkgo#461 to get proper debug output for parellel test note that stuck/failed due to wrong signal.

So its hard to find out the reason for failure with -debug or -stream flag. Now i am going to comment the test commit after 18th Aug 3:30 PM UTC to narrow down the spec that causes the failure.

@amitkrout amitkrout force-pushed the ginkgoParallelTestNodeCiTimeoutDebugging branch from 85c4e7a to 2ddf0a6 Compare August 21, 2020 04:56
@prietyc123
Copy link
Contributor

@amitkrout can we track this in a separate issue. It seems this issue needs good amount of time to investigate. However I suspect it be happening with devfile watch. One more hit - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_odo/3763/pull-ci-openshift-odo-master-v4.5-integration-e2e/1296528880043036672#1:build-log.txt%3A682

@prietyc123
Copy link
Contributor

On travis also:

The command make test-cmd-devfile-log exited with 2.
Log:
ginkgo  -randomizeAllSpecs -slowSpecThreshold=120 -timeout 7200s -nodes=2 -focus="odo devfile log command tests" tests/integration/devfile/
Running Suite: Devfile Suite
============================
Random Seed: 1597991326 - Will randomize all specs
Will run 169 specs
Running in parallel across 2 nodes
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS•SSSSSS•SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS•SSSS
	 -------------------------------------------------------------------
	|                                                                   |
	|  Ginkgo timed out waiting for all parallel nodes to report back!  |
	|                                                                   |
	 -------------------------------------------------------------------
 devfile timed out. path: ./tests/integration/devfile
Ginkgo ran 1 suite in 58.660383924s
Test Suite Failed
Makefile:283: recipe for target 'test-cmd-devfile-log' failed
make: *** [test-cmd-devfile-log] Error 1
/home/travis/.travis/functions: line 607: 25335 Terminated              travis_jigger "${!}" "${timeout}" "${cmd[@]}"
The command "travis_wait make test-cmd-devfile-log" exited with 2.

@kadel
Copy link
Member

kadel commented Aug 21, 2020

Is it always failing on the same test spec?

@amitkrout
Copy link
Contributor Author

None of the time i am able to reproduce out of 20 tires locally. Seems its specific to CI failure

@amitkrout
Copy link
Contributor Author

amitkrout commented Aug 21, 2020

Is it always failing on the same test spec?

Most of the time we are getting a generic failure message like
Screen Shot 2020-08-21 at 9 57 19 AM

But there is a one instance where it gives bit more information. For example it capture the test file name (devfile watch test)
Screen Shot 2020-08-21 at 1 17 05 PM

Right now i am not sure from which spec the error is coming from

EDIT: As per the travis log it says the failure is coming from test-cmd-devfile-log. I will run CI test on this file too to to get more information

@amitkrout
Copy link
Contributor Author

/test all

@kadel
Copy link
Member

kadel commented Aug 21, 2020

It looks like it started with this PR #3733 Could something in this PR cause this?

@amitkrout
Copy link
Contributor Author

amitkrout commented Aug 21, 2020

It looks like it started with this PR #3733 Could something in this PR cause this?

Yes, i am able to reproduce it locally with test-cmd-devfile-log after 3 tries

$ make test-cmd-devfile-log
ginkgo  -randomizeAllSpecs -slowSpecThreshold=120 -timeout 7200s -debug -nodes=2 -focus="odo devfile log command tests" tests/integration/devfile/
Running Suite: Devfile Suite
============================
Random Seed: 1597996972 - Will randomize all specs
Will run 169 specs

Running in parallel across 2 nodes

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS•SSSSSSSSSSSSSSSSSS•SSSSSSSSSSSSSSS•SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
         -------------------------------------------------------------------
        |                                                                   |
        |  Ginkgo timed out waiting for all parallel nodes to report back!  |
        |                                                                   |
         -------------------------------------------------------------------

 devfile timed out. path: ./tests/integration/devfile

Ginkgo ran 1 suite in 1m21.305233171s
Test Suite Failed
make: *** [test-cmd-devfile-log] Error 1

@adisky adisky mentioned this pull request Aug 21, 2020
4 tasks
@amitkrout
Copy link
Contributor Author

$ TEST_EXEC_NODES=1 make test-cmd-devfile-log
ginkgo  -randomizeAllSpecs -slowSpecThreshold=120 -timeout 7200s -debug -nodes=1 -focus="odo devfile log command tests" tests/integration/devfile/
Running Suite: Devfile Suite
============================
Random Seed: 1597997941 - Will randomize all specs
Will run 4 of 169 specs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS•SSSSSSSSSS•SSSSSSSSSSSSSSSSSSSSSSSSSpanic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1003d7e]

goroutine 85 [running]:
bytes.IndexByte(...)
        /usr/local/Cellar/go/1.13.7/libexec/src/bytes/bytes.go:91
bytes.(*Buffer).readSlice(0xc0003064b0, 0xc0000bc00a, 0xc0000bc030, 0x1, 0x0, 0x0, 0xc0000bc030)
        /usr/local/Cellar/go/1.13.7/libexec/src/bytes/buffer.go:418 +0x70
bytes.(*Buffer).ReadString(...)
        /usr/local/Cellar/go/1.13.7/libexec/src/bytes/buffer.go:437
github.com/openshift/odo/tests/helper.RunCmdWithMatchOutputFromBuffer.func1(0xc0003064b0, 0xc0006fc060, 0x2272521, 0xe, 0xc0006fc000)
        /Users/amit/go/src/github.com/openshift/odo/tests/helper/helper_generic.go:202 +0x4e
created by github.com/openshift/odo/tests/helper.RunCmdWithMatchOutputFromBuffer
        /Users/amit/go/src/github.com/openshift/odo/tests/helper/helper_generic.go:200 +0x27c

Ginkgo ran 1 suite in 1m11.555152378s
Test Suite Failed
make: *** [test-cmd-devfile-log] Error 1

ping @adisky
This info could be helpful

@openshift-ci-robot
Copy link
Collaborator

@amitkrout: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/v4.5-integration-e2e 2947f2c link /test v4.5-integration-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@amitkrout
Copy link
Contributor Author

cause identified and will be resolved via #3808
/close

@openshift-ci-robot
Copy link
Collaborator

@amitkrout: Closed this PR.

In response to this:

cause identified and will be resolved via #3808
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rm3l rm3l added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. Required by Prow. label Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. Required by Prow.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants